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Abstract 

Genes are often regulated in living cells by proteins called transcription factors (TFs) that 
bind directly to short segments of DNA in close proximity to specific genes. These binding 
sites have a conserved nucleotide appearance, which is called a motif. Several recent studies 
of transcriptional regulation require the reduction of a large collection of motifs into clusters 
based on the similarity of their nucleotide composition. We present a principled approach 
to this clustering problem based upon a Bayesian hierarchical model that accounts for both 
within- and between-motif variability. We use a Dirichlet process prior distribution that allows 
the number of clusters to vary and we also present a novel generalization that allows the core 
width of each motif to vary. 

This clustering model is implemented, using a Gibbs sampling strategy, on several collec- 
tions of transcription factor motif matrices. Our stochastic implementation allows us to exam- 
ine the variability of our results in addition to focusing on a set of best clusters. Our clustering 
results identify several motif clusters that suggest several transcription factor protein families 
are actually mixtures of several smaller groups of highly similar motifs, which provides sub- 
stantially more refined information compared with the full set of motifs in the family. Our clus- 
ters provide a means by which to organize transcription factors based on binding motif simi- 
larities, which can be used to reduce motif redundancy within large databases such as JASPAR 
and TRANSFAC, which aides the use of these databases for further motif discovery. Finally, our 
clustering procedure has been used in combination with discovery of evolutionarily-conserved 
motifs to predict co-regulated genes. An alternative to our Dirichlet process prior distribution 
is presented that differs substantially in terms of a priori clustering characteristics, but shows 
no substantive difference in the clustering results for our dataset. Despite our specific applica- 
tion to transcription factor binding motifs, our Bayesian clustering model based on the Dirich- 
let process has several advantages over traditional clustering methods that could make our 
procedure appropriate and useful for many clustering applications. Software for our method is 
available a rjhttp : //stat . wharton . upenn . edu/^st j ens en/ research/ cluster . html| 

Keywords: Motif clustering, Bayesian hierarchical modeling, Dirichlet process, Gibbs sam- 
pling 



1 Introduction 



The complete information that defines the characteristics of living cells within an organism is en- 
coded in the form of a moderately simple molecule, deoxyribonucleic acid, or DNA. The building 
blocks of DNA are four nucleotides, abbreviated by their attached organic bases as A, C, G, and 
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T. A-T and C-G are complementary bases between which hydrogen bonds can form. A DNA 
molecule consists of two long chains of nucleotides that are complimentary to each other and 
joined by hydrogen bonds twisted into a double helix. This structure gives rise to the term base 
pair when describing a DNA sequence. A long DNA chain in a living cell is called a chromosome. 
For example, every human cell has 23 pairs of chromosomes with lengths ranging from 47 million 
to 245 million base pairs (Mb). The specific ordering of the four types of nucleotides along these 
chains is the means by which the information is stored that completely defines all functions within 
a cell. The term gene refers to sequence segments along a chromosome that are used to code the 
information for making proteins, the fundamental action molecules of the cell. Surprisingly, only 
about 1 to 2 % of the entire human genome (set of chromosomes) corresponds to gene regions. It 
is believed that much of the mechanism for controlling when, where, and how much protein will 
be produced is located in the "non-coding" region located upstream (ie. directly before) the gene 
sequence. 

Transcribing or activating a gene requires not only the DNA sequence in the upstream region, 
but also many proteins called transcription factors (TF). When these TFs are present, they bind to 
specific DNA patterns in the upstream sequence of genes, and either induce or repress the tran- 
scription of these genes by recruiting other necessary proteins (|Lodish et all Il995h . A particular 
transcription factor protein is able to bind and regulate only certain target genes by recognizing 
a short (6-20 basepairs long) sequence of nucleotides called a transcription factor binding site (or, 
more simply, a site). Different binding sites (located near different genes) of the same transcription 
factor protein show a substantial sequence conservation, which we call a motif, but some variabil- 
ity is also present. 

1.1 Statistical Formulation of Motifs 

Each motif is mathematically formulated as a motif matrix, which measures the desirability of each 
base at each position of the motif. The simplest matrix is an alignment or count matrix Nj^, which 
records the occurrence of base k at position j of all the sites for this motif. Table^shows the count 
matrix for motif MA0011 from the database we will use in Section|3| Also shown in Table 0is the 
corresponding frequency matrix (fj^ = Nj^/N, where N is the number of motif sites) for motif 
MA0011. 

Table 1: Matrix representations of the motif MA0011 
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Schneider and St ephens (1990) used the motif matrix to construct a Sequence Logo as a means by 
which to visualize the appearance of the motif. FigureU g ives the sequ ence logo for the same motif 
MAO 1 1, as constructed by the program WebLogo (|Crooks et fl/.ll2004|) . The height of each position 
is equal to its information content Q^ fc fjk log[/jfc/#ofc] where #ofc is the proportion of base k in the 
non-motif background positions) and the size of each letter is proportional to the letter's relative 
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frequency f jk . 

Figure 1: Sequence logo of the motif MA0011 
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1.2 Discovery of Motifs 



As reviewed in Tense n et al. (2004), Bayesian motif discovery models are the foundation of many 



popular programs for discovering conserved binding sites in large sequence datas ets. These mod 



els a re usually implemente d by an iterative strategy, such as the EM algorithm ^Dempster et al 



Il977h or the Gibbs sampler (Ge man and Gem an. 1984), that utilizes the ease with which the motif 
frequency matrix can be estimated if the binding site locations are known, and the correspond- 
ing ease with wh ich the binding site locations can be estimated if the motif appearance is known. 
Jensen et al. ( 2004) also discuss several constraints of these motif discovery procedures, such as 
assumptions of known motif width (the number of colum ns in the motif frequency matrix) and 



known abundance of binding sites. Te nsen and Liul {2QQ4J) demonstrates that allowing the motif 



width to vary leads to more accurate results in several motif discovery applications. 

Although the discovery and characteriz ation of a single motif is often the goal of a particular bi- 
ological investigation (see, for example, Eichenberger et al. 1 2003)), it is common for scientists to 
be interested in examining the similarities and differences between an entire collection of discov- 
ered TF motifs. Large collections of discovered motifs have bee n utilized i n various applications, 
such as the phylogenetic discovery of co-regulated genes ( Oin et all 2003h and the p rediction of 



synergistic relationships between transcription factors jHannenhalli and Levyl 2002h . These ap- 
plications each represent a highly specialized approach to the utilization of a collection of motifs 
and do not address the issue of a general statistical approach for the sharing of information be- 
tween motifs. 



1.3 Modeling Motif Similarity by Clustering 

Figure E| shows the sequence logos for four different motifs from the JASPAR database which will 
be analyzed in Section|3| It is clear that these motifs all show differences in width and appearance, 
but there exists some similarity or common structure within this set. One could certainly argue for 
grouping MA0031 and MA0032 together based upon their similar appearances, though they do 
differ in both width and composition , but this decision is based on ad-hoc personal judgement. 
The statistical problem of interest here is to model the common structure between these different 
motifs and find a principled means by which to group motifs together based upon their similarity. 
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Figure 2: Four different motifs from the JASPAR database 



MA0011 MA0015 




There are se veral trad i tional statistical techniques for clustering observations together which are 
reviewed in Hartigan (1975). Hierarchical Tree Clustering joins observations together into succes- 



sively larger clusters based upon some sort of similarity measure. K-means Clustering groups 
observations into a pre-determined number of clusters by minimizing a within-cluster distance 
measure. Each of these techniques have elements that are not ideally suited for our desired goal 
of motif clustering. Hierarchical tree clustering requires the user to specify a distance measure 
between the observations (in this case, motif matrices), and it is not clear for comparing motifs 
what type of simple distance metric should be used. In addition, the result of this algorithm is a 
tree that joins all observations toge ther, and it is no t clear where the tree should be "cut" in or- 
der to produce a set of clusters. Kielbas a et al. $2005) proposed two different distance metrics for 
the clustering of motif matrices, but needed to impose arbitrary thresholds on those distances in 
order to produce a set of clusters. Similarly, Sch ones et al\ (|2005) present several distance metrics 
for comparing motif matrices, and group together pairs of matrices below a p-value-based thresh- 
old. K-means clustering is ideally suited for situations where the number of clusters is known a 
priori. When the number of clusters is unknown, K-means clustering becomes more difficult, and 
usually a cross-validation strategy is employed to estimate the number of clusters. For our motif 
clustering applications, there is very little prior idea of how many motifs might cluster together in 
a particular collection of motifs and so we seek a model that easily allows for an unknown number 
of clusters. 

In addition, these techniques consider the observations themselves as fixed and known, which is 
not the case for our applications where each motif is only an estimate generated by a prior motif 
discovery procedure. Recognizing that our discovered motifs themselves are estimated quantities, 
we need to model both within-motif and between-motif variability. In Section |2j we outline a 
Bayesian hierarchical clustering model that encompasses both levels of variability and does not 
require prior knowledge of the number of clusters. As discussed briefly in Section lL2l most motif 
discovery procedures assume a fixed and known motif width, but in reality the width is often 
unknown for many motifs and can vary substantially between different motifs. In Section l2~5l we 
extend our Bayesian motif clustering model to allow each motif width to be an unknown variable 



4 



that will be estimated by our procedure. Our model is implemented stochastically by a Gibbs 
sampling algorithm, which allows us to examine not only "best estimates" of motif clusters, but 
also the variability within our clustering results. In Section [3J we present various techniques for 
summarizing and understanding the results from our clustering procedure, within the context of 
an application to the JASPAR database of transcription factor binding motifs. 

These clusters provide a way to organize transcription factor proteins based on the similarity of 
their binding motifs. We utilize this clusterin g to analyze la rge databases of tra nscription factor 
matrices, TRANSFAC jWingender et aZlbOOtfr and JASPAR jSandelin et al\\ xm\ for motifs with 
high similarity (Section |4j and discuss how this clustering information can be used to refine the 
search for additional transcription factor binding sites. In addition, we briefly discuss combining 
our clustering procedure with motif discovery in sequences that have been conserved by evolu- 
tion, in order to predict genes that share similar transcription factor binding motifs and thus are 
possibly co-regulated (Section^. 



2 Bayesian Motif Clustering Model 



We use a Bayesian hierarchical model to infer common structure, in the form of clusters, within a 
collection of discovered motifs. The data for each discovered motif is a count matrix Nj which can 
have different widths and number of counts (ie. number of binding sites included in the matrix) 
compared to other TF motifs. For now, we assume that our clustering will be based on a motif 
matrices with a fixed and known width w, so we assume each of these n raw motif matrices N, 
should contain a submatrix Yj of dimension w x 4 that will be considered the core upon which 
the clustering will be based. We will later extend our model to allow the core width within each 
cluster of motifs to vary. 



2.1 Hierarchical Framework 



Hierarchical models are useful in a variety of scientific problems when the structure of the data 
suggests multiple levels of uncertainty. We want to include components for both within-motif 
and between-motif variability of the nucleotide counts where i indexes the motif, j indexes 
the w columns within each motif core, and k indexes the four possible nucleotides (k = a, c, g, or 
t) within each column. Our model on the within-motif variability between different binding sites 
for a count motif Yj is a product-multinomial model. We assume that each position (column) of 
the core count matrix Yj follows an independent multinomial distribution parameterized by the 
corresponding column of an unknown frequency matrix 0j, ie., 

w 

Within-motif level: Count matrices p(Y,|0j) = f\ PO^ijWij) 

j i 

Yy = (Y ija ,Y ijc ,Y ijg ,Yij t ) ~ Multinomial (nj,^- = (0y O , %c, 0yt)) 

For our between-motif variability, we simply assume that each motif frequency matrix 0j in our 
collection share a common but completely unknown distribution, denoted F(-), ie. 
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Between-motif level: Frequency matrices p(&i) 



where F(-) is an unknown distribution with w dimensions for the columns x 4 dimensions for the 
nucleotides (constrained to sum to one). This unknown distribution F(-) represents the common 
structure between the different motifs in the dataset. Estimation of this unknown distribution is 
complicated by the fact that our frequency matrices 0j are unknown, with only the count matri- 
ces Yj being observed. A popular Bayesian approach to non-parametric problems is to give the 
unknown distribution F(-) a Dirichlet p rocess prior, T>(j), where 7 is a finite (non-negative) mea- 
sure, typically smooth ( Fergusorl Il974h . Here, since we have a multidimensional F(-), we use a 
Dirichlet process prior 2? (71 X • • • x 7^), where each smooth measure jj is four dimensional, taking 
the form of jj = b x Dirichlet(a, . . . , a) for j = 1, . . . ,w. The parameter b is a weighting factor that 
characterizes how close the unknown distribution F is to the shape of 7 and the smoothness of F. 



2.2 Clustering of Observations 

An important consequence of our model is that it enables similar motifs to b e clustered together 



An important consequence ot our model is mat it enables similar motits to be clustered togetner 
into a group modeled by one common frequency matrix. As explained in Fereuson (Il974h . if 
©i, ... , 0„ are n i.i.d. observations from the probability function F whose prior distribution is 
the Dirichlet process ^(7), where 7 is a finite measure on the domain, then 

n 

F(-)|0i, • ■ ■ , ©„ ~ 2%*) = V( 1 + Y J S &j ) 

3=1 

Thus, the posterior mean of F(-), or the predictive distribution of a new observation, is propor- 

n 

tional to 7 + Yl ^® • W the 0's only take on C distinctive values, then we have a mixture of the 

smooth measure 7 and C point masses (with potentially different weights). These point mass 
components allow for the clustering of similar observations. If we were to draw an additional 
(n + l)-th observation 0* from this distribution D(7*), that new observation would either come 
from the smooth measure 7, or would take on a value exactly equal to one of the current 0/s, say 
C , in which case C and 0* are defined as being in the same cluster. The conditional distribu- 
tion p(©i|0_j) of one current observation 0j, given all other observations i, is also a mixture 
between the smooth measure and C point masses at each of the 0_j that represent the unique 
values within 0_^. Any observations m and 0„ that have the same value are defined as being 
in the sa me cluster. This conditiona l distribution allows us to implement our model via Gibbs 
sampling jGeman and Gemanl.ll984|) . The Di richlet process has b een used as a prior distribution 
in nonparametric Bayesian a nalyses, su ch as iMacEachern 1.1994) and lEscobaii ll9 94^ for the esti- 
mation of normal means and Liu (1996) in a binomial hierarchical settingT lGreen and Richardsonl 
(2001) discuss the use of the Dirichlet process as a flexible model for clustering observations, and 
present an ex tended class of Dirichlet-Multinomial allocations for which the Dirichlet process is a 
limiting case. M edvedovic and Sivaganesa 3 E002) uses the clustering properties of the Dirichlet 



process prior as part of a hierarchical model for gene expression profiles from microarray data. 
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2.3 Gibbs Sampling Implementation 



For our motif clustering model, a Gibbs sampler could intuitively be based on p(0j|0_$). How- 
ever, since our 0/s are actually unknown, a mor e efficient clustering procedure involves drawing 
values of the clustering indicators directly, as in iMacEacherrrl jl994h , without dealing with draw- 
ing a frequency matrix 0/ for each motif i at each iteration. We denote our clustering indicators 
z = (zi, . . . z n ), which simply defines a partition of {1, ... , n}. Algorithmically we let z, = c if 
&i belongs to the c-th cluster or z% = if 0j is d rawn from the prior measure 7, and hence forms 
a new cluster. Our "collapsed" Gibbs sampler l|LiuL 1199*3) iteratively samples from p(zj|z_j, Y) 
where we again use the notation z_i or 0_ j to mean all the z or parameters except the i-th one. 
As mentioned earlier, no matter what the current 0_j is, as long as they correspond to the same 
indicator vector z_j, we have the same (almost surely) conditional prior distribution p(zj|0_j). 
Thus, we can write that 

b ti 

p(Zi\@-i) = p(Zi = 0|z_j) = — p(Zi = c|z_j) = —— - — - (1) 

o + n— 1 b + n — 1 

where n c is the size of cluster c (ie. the number of z's in z_j which are equal to c) and b is the weight 
parameter for forming a new cluster. Thus, under this model the prior probability for joining a 
particular cluster increases as the number of observations in that cluster increases, implying that 
the Dirichlet process prior favors unequal allocations of observations. With observations Y, we 
have the posterior conditional distribution 

p(zj|z_i,Y) ocp(z,Y) ocp(Y I z)p(zi I z_j). 

It is evident from our model that p(Y | z) is straightforward to derive and can be written as the 
product of the normalizing constants of C product multinomial distributions, where C is the total 
number of distinctive zs. After some simplifications, we have the probability that observation Yi 
forms a new cluster is 

p(*.-0|. ■ Y) oc b TT n * F(y ^ + a) r ^ (2) 
p(z 4 -0|z_,,Y) oc b + n _ 1 ll r{EkYijk + 4a)T{ar , (2) 

where b is the weighting factor as defined at the end of Section |2~T1 For the case where Zj = c ^ 
(i.e., joining an existing cluster that already has a count matrix Y c ), we have 

b + " - 1 ji r(E* y,,t + Y ljt + 4a) Uk r(Ycjt + «) 

A complete iteration of our Gibbs sampling algorithm results in a complete sample z of our clus- 
tering indicators, which also represents a complete partition of our motif matrices. 



2.4 Motif Alignment 

An additional component of our model addresses the fact that we do not necessarily know which 
core Y{ of width w to use within the raw alignment matrix of width n» > w for motif i. We use the 
notation = j to mean that we are using the columns j, j : + 1, j ; + w — 1 of our raw motif matrix Nj 
as our core Yj. For example, if our clustering algorithm is based on a fixed width of w = 6 and our 
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i-th raw motif matrix Nj has 8 positions, then we have three possible choices for our core motif: 
cij = 1 (Yj = columns 1 to 6 of Nj), a\ = 2 (Yj = columns 2 to 7 of Nj), or a, L = 3 (Yj = columns 3 
to 8 of Nj). We thus need an additional step where, for each raw data matrix, the best location of 
the central motif Oj is drawn conditional the other motifs Y_j and clustering indicators z_j for the 
other motifs. Let Y^ denote the core Yj that corresponds to the choice of a particular aj, then the 
posterior probability p(a{) = p(ai |z_j, Y_j) of aj is 



p( ai ) oc / p(Y l ai |0 l )j3(0 l |z_ l ,Y_ l ) d& l 



b + n - 1 
T(4a 



r(4a) 



b + n-1 



Fa 4 



Ta 4 



c 

E 



p(Y c , © c |z_,) 



d0, 



n^vg, + fe* + ^ r(E fc y CJfc + 4a) (4) 



r(E, y& + 4 «) ^ 6 + « - 1 f = \ r(E, y# + + 4a) n fc r(v^ + . 



This alignment procedure is performed every tenth iteration of the collapsed Gibbs sampler de- 
scribed in the previous section. 



2.5 Allowing Cluster Width to Vary 

We now extend our model to allow the core motif widths within each cluster, w = {w\, . . . , wc), 
where C is the current number of clusters, to be unknown variables. Each cluster width w c is 
modeled as being independent with prior distribution w c ~ Poisson(A) where A is the expected a 
priori width of the motif in each cluster, which we assume is fixed and specified. We let be the 
"background" counts of nucleotide k over all columns of the raw matrix Nj that are not included in 
the core matrix Yj, which now must be taken into account by our model since each motif width is 
allowed to vary. These background columns represent the edges of the raw motif matrices that are 
not well conserved. We assume that the background counts, Bj = (Bi a , . . . Ba), are a multinomial 
realization from an underlying vector of background nucleotide frequencies #o = (#0a> • • • > #ot)- 
With these added distributions, the posterior probability for the core width of a particular cluster 
c, conditional on the current members of that cluster z c and their core alignments a c , is 

v(w Iz aloe ft n»r(iU + a) r(4a) -pr^ \ Wc e~ x 

where Y C jk and B c j, are, respectively the core and background nucleotide counts in cluster c. We 
implement this added component of our model by an additional step in our Gibbs sampling algo- 
rithm that, for each cluster c, samples a new valu e of w c from the conditional posterior distribution 
0. Some methods, such as Schones et al. |2005), use a fixed number of columns to calculate the 
similarity between matrices and these core widths are then held fixed during their clustering pro- 
cedure. In contrast, our Gibbs sampling implementation allows the core width of each motif to 
change depending on the current set of clusters. This means that each core width in our model is 
being estimated using additional information from other motifs in the dataset, rather than inde- 
pendently estima ting each core width using only information from each matrix separately. Other 
methods, such as Kielba sa et al\ l|2005h . cluster motifs on the basis of a pairwise distance and also 
estimate core width based on a pairwise comparison. This is a slight improvement compared to 
treating each matrix individually, but still does not use as much information as our model, which 
makes width decisions on the level of an entire cluster, not just for pairs of motifs. 



8 



2.6 Alternative Clustering Priors 



As mentioned in Section lZ3l the Dirichlet process prior favors unequal allocation of observations, 
meaning that each new observation has a greater prior probability of being placed in a cluster 
that already has many observations. An alternative is a uniform clustering prior which favors 
equal allocations of observations i.e., the prior probability that a new observation is placed in any 
one of the existing clusters is uniform. If we already have n observations divided into clusters 
c = 1, . . . , C with ni,...,n c members, then the two prior distributions are 



Dirichlet process prior 

P(z n+1 = c|z,DP) = ^ 
P(z„ +1 = new|z,DP) = ^ 



uniform clustering prior 

P{z n+ i = c|z,unif) = 
P{z n+ i = new|z,unif) = ^ 



where b is the weight given to forming a new cluster. In fact, we can consider both the Dirichlet 
process and uniform clustering specifications as particular cases of a more general clustering prior 
distribution, where 

P(z n+ i = new|z) oc b P{z n +i = c|z) oc f(n c ) c=l,...,C (6) 

This general clustering model reduces to the Dirichlet process when f{n c ) = n c and the uniform 
clustering prior when /(n c ) = 1, but more general functions may be desirable in particular sit- 
uations. The prior density of a partition z with C clusters under either our Dirichlet process or 
uniform clustering model can be calculated recursively. With the Dirichlet process prior, we have 
prior density 

b c ■ nOc-i)! 

p(z|DP) = (7) 

U(P + i-l) 

With the uniform clustering prior, we have prior density 

„.^^±£> (8) 
n (p + c) n - 

c=l 



It is worth noting that the Dirichlet process prior density is not affected by the ordering of 
the clusters, whereas the uniform clustering prior density © is. In other words, different parti- 
tions with the same cluster sizes are exchangeable under the Dirichlet process model, but we will 
get different values of the uniform clusteri ng density for different , but e xchangeable orderings of 
unequally-sized clusters. As suggested by lGreen and Richardsonl ||_C)01), to ensure exchangeabil- 
ity of our uniform clustering model, we need to make our prior density p(z|Unif) a function of 
a "signature" of the partition that is identical for exchangeable partitions. For example, if we let 
p*(z|Unif) = k ■ p(z'jUnif) where z' is z with the z/s arranged in order from the largest cluster to 
the smallest, then the calculation of © for z' will be the same for all exchangeable values of z. All 
of these complications are avoided in the Dirichlet process model which automatically gives the 
same prior density value for exchangeable partitions. 
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We compared the behavior of these two clustering prior specifications for a simple simulation 
study, where 1000 complete partitions z = (z%, . . . ,z n ) with n = 106 and 6=1 were generated 
under both prior distributions. This particular sample size n was chosen to be comparable with 
the real data in our first application in Section|3]below. Figure |3] displays the distributions of both 
the number of clusters as well as the size of the multiple-member (n c > 1) clusters over all of 
our simulated partitions. As expected, the number of clusters (with multiple members) is larger 
under the uniform prior and the size of some clusters from the Dirichlet process are larger than 
any generated from the uniform prior specification. 

Figure 3: Comparison of clustering statistics between DP and uniform clustering priors 



Number of Clusters - DP Size of Clusters - DP 




We now discuss several applications of our Bayesian motif clustering model to collections of tran - 
scription factor matrices. In Section |3j we apply our method to JASPAR (Sandeli n et all |2004), 
a small but heavily curated database of transcription factor matrices. In Section |4j we apply 
our method to a collect ion consisting of all matrices from JASPAR and the TRANSFAC database 
( Wingen der et all l2000i) . which is a larger but less curated database of transcription factor matri- 
ces. Finally, in Section|5j we discuss a cross-species application where discovered motifs are used 
to infer co-regulated genes in bacteria. 

3 Application to a single database: JASPAR 

We use the JASPAR llSanderinef fl/.l. f2004) database as an example for illustrating different strate- 
gies for visualizing and analyzing results from our clustering model. This database contains 111 
nucleotide-count matrices which differ substantially in appearance, number of counts, and motif 
width. Between different motifs, the number of binding sites used to construct the motif ma- 
trix (the total number of multinomial counts) varies from 6 to 389, with an average of around 35 
counts per matrix. The range of matrix widths was from 4 to 30 bp, though the matrices were 
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generally short, with an average width of approximately 11 bps. From this database, we also have 
species information for almost all motifs, as well as a classification into a particular "protein fam- 
ily" based on the common physical structure of each motif's DNA-binding domains. For example, 
one family of transcription factors is the helix-loop-helix family, which has two DNA-binding he- 
lix domains that bind directly to the DNA strand and are joined by a loop domain. We will use 
this extra protein family and species information when we examine the results produced by our 
clustering model. 

For our Dirichlet process prior, we chose prior parameters a and prior weight b to both be equal 
to 1. We chose a small prior weight b for simplicity, since we had no prior knowledge on the 
number of clusters to expect, and our clustering results did not change substantially when using 
larger values of b. We also assumed our background frequencies were uniform across nucleotides, 
ie. 9ok = 0.25 for k = 1, . . . , 4. Although we allowed the motif widths to vary, we restricted the 
core width of each motif to be at least 6 bps in order to reflect biological reality, with an a priori 
expected core width A of 8 bps. This restriction reduced our dataset from 111 matrices down to 
106 matrices. As described in Sections 12.312.51 our Bayesian hierarchical clustering model was 
implemented using a Gibbs sampling algorithm. Details of our evaluation of convergence are 
given in the supplemental materials. 

3.1 Tree of Pairwise Clustering Probabilities 

An intuitive means for examining our overall clustering results is the posterior probability pij that 
a particular pair of motifs i and j are in the same cluster. The value of p^ for any two motifs i and 
j can be estimated by the proportion of iterations that have motif i and j in the same cluster, which 
is the Monte Carlo estimate of the posterior mean of the indicator variable for motif i and j being 
in the same cluster. The full list of pairwise clustering probabilities is given in the supplemental 
materials. Based on these pairwise clustering probabilities p^, a pa irwise distance measure can be 
calcu lated between each pair of motifs in the dataset, dij = 1 — . Me dvedovic and Siva ganesan 




(2002) use the same distance measure in their gene expression application. Our distance matrix 
was then converted into a tree diagram by an average-linkage hierarchical algorithm, which is 
shown in Figure |U 

This clustering tree shows several strong relationships, such as the group of NUCLEAR motifs 
and the group of bHLH motifs on the lefthand side of Figure |4j Many weaker relationships are 
also present, implying that many motifs have a low but non-zero probability of being grouped 
together. Although many of the stronger clusters of motifs belong to the same protein family 
and same species, there are several interesting exceptions. The NUCLEAR and bHLH groups 
mentioned above contain mostly motifs from Homo sapiens, but the NUCLEAR group also contains 
a motif from fruit fly Drosophila melanogaster and the bHLH group includes motifs from mouse 
(Mns musculus). 

3.2 Best Clustering Partition and Cluster Strength 

Although Figure H| allows us to examine the clustering structure of the entire dataset, the tree is 
not ideal for deducing the "best partition" or best set of clusters in the dataset, since these pairwise 
probabilities are calculated across many different partitions, similar to the problem for hierarchical 
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tree clustering mentioned in Section IL3I One could "cut the tree" at any number of different 
threshold distances and thereby produce any number of possible partitions, but a less arbitrary 
alternative is to estimate of the posterior mode of our clusters from our MCMC simulation. We 
estimate this posterior mode by calculating the posterior probability of the partition z at the end of 
each iteration of our sampler, and retaining the partition z with the highest posterior probability 
as our best estimate of the mode. The probability p(z|Y, B) of z is calculated as the product of the 
likelihood value of our cluster matrices Y c (the sum of all u> c x 4 count matrices in cluster c) and 
total background counts B^, 



p(Y,B|z) 
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nn 
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and the prior densities of our clusters p(z) and variable motif widths p(w). Although this best par- 
tition can reduce our dataset down to a list of interesting clusters, we have lost information about 
the variability of these clusters by focusing on a point estimate. In order to retain some measure of 
variability, we incorporate cluster-level and observation-level clustering characteristics within our 
"best cluster s". We can measure th e strength of each cluster by calculating the logarithm of the 
Bayes factor (|Kass and Rafteryl Il995h for the current cluster c, with members z = (zi, Z2, ■ ■ ■ , z nc ), 
versus each member of the cluster forming its own cluster, 



Strength(Cluster c) = log 



P(Y|z all same) 



P(z all same) 



P(Y\z all different) P(z all different) 



For a cluster of motifs (Yi , . . . , Y m ) and clustering indicators z 
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where Y and again denote the count and frequency matrices for the entire cluster together. The 
clusters within our best partition can then be ranked by this measure of cluster strength, giving us 
an extra measure of confidence/ uncertainty about inference based upon a specific cluster. We can 
also measure clustering strength at the level of individual motifs within our best partition by cal- 
culating, for each motif, the posterior probability that it should belong to that cluster, as opposed 
to any of the other existing clusters or being its own cluster. For each motif i, this posterior prob- 
ability p(zi\z-i, Y) is the same calculation that is performed during each iteration of our Gibbs 
sampling algorithm, but in this case we are conditioning on the best partition i.e., p(£j|z_j, Y). 

For our dataset, the partition z with the highest posterior value consisted of 26 multiple-member 
clusters containing 69 out of 106 total motifs. The strongest 6 of these 26 clusters (using our clus- 
ter strength measure) are listed in Table |2| along with their cluster size, width and consensus 
sequence. The consensus sequence is a representation of the total count matrix for the cluster, 
giving the nucleotide with the highest count in each position. A nucleotide is only capitalized 
if its nucleotide frequency is greater than 0.75 in that position. It is clear from the table that this 
measure of cluster strength is quite dependent upon the size of the cluster: larger clusters tend to 
have a higher value of cluster strength. 

Most of our strongest clusters contain motifs from within a single TF protein family, though there 
are exceptions, such as the fifth cluster that contains both ZN-FINGER and FORKHEAD motifs. 
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Table 2: Strongest Clusters from the Best Partition of JASPAR matrices 



Clus 


Size 


Strength 


Width 


Consensus 


Protein Families 


Species 


Motifs 


1 


6 


186.5 


6 


aGGTCA 


NUCLEAR 


H. sapiens 

D. melanogaster 


MA0065 MA0066 
MA0071 MA0072 
MA0074 
MA0016 


2 


5 


179.5 


6 


CACGTG 


bHLH-ZIP 
bHLH 


H. sapiens 

M. musculus 
M. musculus 


MA0058 MA0059 
MA0093 
MA0104 
MA0004 


3 


3 


82.2 


7 


aTGACGT 


bZIP 


A. majus 
H. sapiens 


MA0096 MA0097 
MA0018 


4 


3 


72.7 


7 


CCGGAAg 


ETS 


H. sapiens 

D. melanogaster 


MA0028 MA0076 
MA0026 


5 


3 


72.5 


8 


GTAAACAa 


FORKHEAD 
ZN-FINGER 


H. sapiens 

D. melanogaster 


MA0030 MA0031 
MA0013 


6 


2 


40.0 


8 


cAATtATT 


HOMEO-ZIP 


A. thaliana 
A. thaliana 


MA0008 
MA0110 



Clearly, this cluster would not have been detected if motifs were only grouped together based on 
TF family. It is also interesting to note that most of the larger clusters contain similar motifs from 
different species, with motifs from human (H. sapiens) being grouped with motifs from mouse 
(M.musculus), fruit fly (D. melanogaster) and snapdragon flowers (Antirrhinum majus). All of the 
motifs in Table |2]have individual clustering probabilities close to 1, but many of the weaker clus- 
ters in the best partition contain motifs which have individual probabilities substantially less than 
one. Our entire best partition of JASPAR matrices is given in the supplemental materials. 



3.3 Core Width Variability in JASPAR 

A key component of our clustering procedure is that the width of the core motif within each raw 
matrix is not considered to be fixed and known, but is instead allowed to vary by cluster in our 
model. We examine the variability of these core widths in Figure|5j which is a plot of 95% posterior 
intervals for the core width of each motif in our JASPAR dataset. 

The 95% posterior intervals are quite different between motifs, with some motifs having a wide 
interval while other motifs have an interval consisting only of the minimum motif width of 6 bps, 
which in many cases is because the raw data matrix is only 6 bps wide. In several motifs, the 95% 
posterior interval for the motif width does not even include the a priori expected motif width of 
8 bps. The wide intervals for seve ral motifs suggests that c onsidering core widths as fixed and 
known, as in Cartharius et al. (2005) and Schones et al. (2005), would result in a substantial loss of 
information 
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3.4 Effect of Prior Specification on JASPAR results 



In Section 1731 we discussed the a priori differences between the Dirichlet process prior compared 
to the uniform clustering prior. We now investigate whether these differences are also apparent in 
the posterior JASPAR results. The distribution (over all partitions produced by the Gibbs sampler) 
of the number of multiple-member clusters and the average size of our JASPAR clusters is given 
in Figure |6| 

The Dirichlet process prior and uniform prior give dramatically different clustering results based 
upon prior simulation alone (Figure |3j , but show very slight differences in the posterior cluster- 
ing results (Figure |6j|. Only minor differences were observed between our Dirichlet process and 
uniform clustering models in terms of the clustering trees (Section l3.1ll and best partitions (Sec- 
tion 13.2b . which indicates that our choice of a Dirichlet process prior distribution was not very 
influential on our posterior clustering results, at least in comparison to a uniform clustering prior 
alternative. However, other datasets may show a larger influence of the prior specification on 
the posterior clustering results, and our sof tware allows the user to specify the use of either prior 
distribution. Green and Richardsonl (2001) demonstrate with several datasets that the unequal 



allocations favored by the Dirichlet process priors can persist in the posterior distribution. 



4 Application to Combined Databases: JASPAR and TRANSFAC 

Our motivations for applying a clustering model to established databases, such as JASPAR or 
TRANSFAC, are to eliminate "redundant" matrices within these databases and to understand the 
similarities as well as differences among the different transcription factors. If two (or more) mo- 
tifs have nearly identical core matrices, we can consider them as redundant since they do not 
contain unique information in terms of appearance. The presence of redundant matrices com- 
plicates the use of these databases for further motif discovery, since searc hes involving severa l 
nearly identical matrices will lead to an excessive number of predicted sites (Kielbas a et all 12005). 
and it also affects how we evaluate computational results. As mentioned in Sections I2.4W2.51 an 
additional complication in these databases is that the motif core has unknown width and location 
within each matrix, which is also addressed by our clustering model. In Section|3j we already ob- 
served substantial redundancy within the JASPAR database alone. Now, we want to comb ine our 
JASPAR matrices together with the larger TRANSFAC database I Wingend er et all l2000h . which 
contains 714 transcription factor matrices. In this combined collection of 825 matrices, we ex- 
pect substantial redundancy which we will address with our clustering procedure. Similar to our 
JASPAR-only application, we chose prior parameters a and b to both be equal to 1, expected core 
width A = 8 bps, and background frequencies 6>ofc = 0.25 for k = 1, . . . , 4. We again restricted the 
core width of each motif to be at least 6 bps, which reduced our dataset from 825 matrices down 
to 817 matrices. Evaluation of Gibbs sampling convergence for this combined dataset are given in 
the supplemental materials. 

The best partition produced by our Gibbs sampling implementation consisted of 165 multiple- 
member clusters containing 746 out of the 817 matrices in our combined collection. The clusters 
in our best partition are given in the supplemental materials, along with a full list of the pairwise 
clustering probabilities. We also examined this combined dataset using our model with the al- 
ternative uniform clustering prior, and observed very little difference in the posterior clustering 
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results (details given in supplemental materials). Our expectations of high levels of redundancy 
in these databases seems confirmed by the fact that approximately 90% of the matrices in our 
combined collection are partitioned into multiple-member clusters. The strongest 10 of these 165 
clusters (using our cluster strength measure) are listed in Table |3j along with their cluster size, 
width and consensus sequence (as defined in our previous application). For brevity, the individ- 
ual members of each cluster are not given in Table |3j Instead, the total number of members from 
either JASPAR or TRANSFAC in each cluster is given. 

Table 3: Strongest Clusters from the Best Partition of TRANSFAC and JASPAR matrices 



Cluster 


Size 


Strength 


Width 


Consensus 


Number of Members 
JASPAR TRANSFAC 


1 


24 


1015.3 


6 


CACGTG 


1 


23 


2 


25 


974.9 


6 


TGACGT 


2 


23 


3 


15 


697.5 


8 


TTTcGCGC 


1 


14 


4 


18 


550.1 


6 


aGATAa 


1 


17 


5 


16 


548.4 


6 


CACGTG 


4 


12 


6 


17 


486.5 


6 


cGGAAg 


3 


14 


7 


7 


358.6 


6 


TGTTCT 





7 


8 


11 


338.7 


7 


tcACGTG 





11 


9 


14 


336.7 


6 


TGAcCt 


1 


13 


10 


11 


319.3 


6 


AAAGcg 


4 


7 



From Table |3j we can see that there is substantial redundancy both between databases and within 
each database. The redundancy seems to be more substantial within the TRANSFAC database, 
even taking into account the fact that the JASPAR matrices are a small component (13%) of the 
combined collection. This is not surprising, since there was a substantial amount of manual cura- 
tion involved in the creation of the JASPAR database. We also attempted additional merge moves 
to try and combine our strong clusters with similar appearance but these moves did not actually 
increase our posterior density. Just as in our smaller JASPAR application (Section 0, we again 
observed that a substantial number of motifs in our combined dataset also had substantial vari- 
ability in their core widths and alignments. 87% of the motifs in the combined dataset had 95% 
posterior intervals for the core width which covered more than a single fixed value (details given 
in the supplemental materials). It is also worth noting that a majority of the remaining 13% of 
motifs had no variability simply due to the fact that the raw motif width was equal to the min- 
imum co re width. Clearly, thes e res ults suggest that th e assumption of fixed and known motif 
widths in Cartharius et al\ |2~005) and Schon es et al\ |2005h is a tenuous one, and results in the loss 



of substantial information. 



Several existing motif-finding programs, such as Matlnspector ( Carthari us et al. , 2005|) . attempt to 



utilize non-redundant collections of matrices for scanning large sets of genomic sequences for tran- 
scription factor binding sites. Matlnspector uses a collection of matrix clusters created by manual 
cura tion, whereas we o bviously prefer an automated method for generating clusters of matri- 
ces. ISchones et al\ |2005) present a clustering method for motif matrices which allows matrices to 
belong to multiple clusters, which means their method does not produce a partition of matrices 
which can be used to reduce redundancy. Our model implementation produces a best partition 
of clusters, alignments and core widths, all of which are needed to create a super-matrix for each 
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cluster, which is the sum of the aligned core matrices in that cluster. The set of super- matrices 
can then be used as inputs for a sequence scanning algorithm, such as lHuang et al\ j200# . When 
looking for matches within genomic sequences, using these super-matrices reduces the number 
of comparisons that are needed compared to using each matrix in a database individually. This 
reduction of redundant comparisons not only eases the computational burden of sequence scan- 
ning, but also helps somewhat with the usual problem of evaluating the statistica l significance of 
good matches in this multiple comparison setting, as mentioned in lKielbasa et al. j2005h . 



5 Application to Cross-Species Conservation 



An additional application of our motif clustering procedure is presented in llensen et al. 12005), 



where phi/logene tically-discovered m otifs are clustered to infer co-regulated genes. Phylogenetic 
motif discovery jlVIcCue et all 2001 ) searches for conserved motifs in upstream sequences from dif- 



ferent, but related, species under the assumption that transcription factor binding motifs are likely 
to be conserved by evolution. If the motifs found upstream of several Bacillus subtilis genes are 
similar enough to be clustered together, then it is possible that the same TF (recognizing that com- 
mon motif) is targeting each of the genes in that cluster. Thus, by combining statistical techniques 
for both motif d i scove ry and motif clustering, one can infer clusters of potentially co-regulated 
genes, bin et all ^2003) inferred co-regulated genes by applying a clusterin g algorithm to previ- 
ously discovered motifs in E.coli. Instead of using a Dirichlet process prior. lOin et al\ j2003h used 
the same uniform clustering prior presented in Section l2~6l However, their clustering techinique 
was simpler than our model in that the core motif width was considered to be fixed and known, 
which we suggest in Sections |3] and H| is quite a restrictive assumption. 



In lTensen et a l. ( 2005), we use phylogenetic motif discovery on genomic sequences from the bac- 
teria Bacillus subtilis and six related bacterial species. Our Bayesian hierarchical clustering model 
is then used to create clusters of highly-similar motifs within our collection of discovered motif 
matrices. Since each motif contains sites discovered in close proximity to a particular B.subtilis 
gene, our motif clusters can be also interpreted as clusters of possibly co-regulated genes. These 
gene clusters show substantial evidence of co-regulation when compared to several sources of ex- 
ternal information about B.subtilis genes, such as microarray gene expression data and functional 
ontology. In addition to using the best partition of clusters, we were also able to incorporate our 
clustering variability in this application. Our cluster strength measure was again used to rank 
clusters, and several of the strongest clusters were also shown to be strong in terms of external 
evidence of co-regulation. Our individual motif strengths p(zi |z_j, Y) were also used to filter out 
weakly-associated motifs from our clustering results. 



6 Discussion 



Although we have presented our clustering procedure in the context of several specific applica- 
tions to motif matrices, these advantages of our Bayesian clustering model are not specific to this 
particular type of data. Bayesian hierarchical clustering models based on a Dirichlet process prior 
distribution should be considered an attractive approach to many clustering problems. Our hi- 
erarchical framework lets us account for uncertainty in the count matrices that represent each TF 
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motif by assuming a product multinomial distribution, whereas most clustering programs assume 
the count matrices are fixed and known without error. The second advantage is that our clustering 
strategy uses a model-based similarity measure rather than some ad hoc distance measure in order 
to compare motifs. At each iteration of the Gibbs sampling algorithm, the decision to cluster a par- 
ticular observation is determined by the conditional distribution of Z{ given all other information 
(z_j, Yj). Thus, our distance metric is exactly equal to the conditional posterior distribution under 
our full Bayesian clustering model. A third advantage of our clustering model allows not only the 
clusters themselves to vary (in terms of which motifs are members of which clusters) but also the 
number of clusters is allowed to vary. This is a key improvement over a clustering technique that 
requires the number of clusters to be fixed (such as K-means clustering) since, in this situation, 
we have very little idea a priori about how many motifs we might expect would be similar to each 
other. Standard hierarchical tree clustering is also less i deal in this situation, sinc e an arbitrary 
thresh old must be used to produce a set of clusters (eg. Kiel basa et al. j2005h and ISchones et al 
(200l)). Another general advantage of our procedure is that our posterior sampling implemen- 
tation gives us an idea of the variability of our clustering results, whereas traditional clustering 
methods typically give only a point estimate. We explore several summaries of this variability, 
including tree structures that summarize the pairwise clustering probability of our motifs, as well 
as measures of strength for entire clusters and individual motifs within clusters. However, fur- 
ther research is needed into effective techniques for analyzing stochastic clustering results, since 
the usual procedure of averaging across iterations is not appropriate when both cluster sizes and 
individual memberships within clusters vary between iterations. 

We also presented a novel extension of our model that allows the motif width within each cluster 
to vary, and our results indi cate that m any motifs have substantial motif width variability. Pre- 
vious methods, such as loin et al. (2003) may be ignoring important information by considering 
motif core widths to be fixed and known a priori. Our model also addresses the alignment issue 
that, within each raw motif matrix, it is not obvious where the core motif is located. Our model 
allows us to condition on the motif core in all other raw matrices within the current cluster when 
we calculate the most likely location of the motif core within a particular matrix. In many cases, 
other matrices may show very similar compositions to the matrix in question (especially matrices 
within the same cluster), in which case the conditioning provides a substantial amount of infor- 
mation pertaining to the motif core location. This extra information is ignored by methods (eg. 
ISchones et al\ ll2005^ whic h use fixed widths d uring their clustering procedure, and only partially 
captured by methods (eg. iKielbasa et al\ (2005)) which use widths estimated by pairwise compar- 
isons of matrices. 

As mentioned in SectionHJ our clustering results eliminate the redundancy within current matrix 
databases, which will benefit future motif discovery by reducing the number of redundant hits 
when scanning sequences for known transcription factors. However, it may be possible to utilize 
our clustering model for motif dis covery in a more s ophisticated way. A Bayesian framework for 
motif discovery was presented in Hensen et all |2004) based on a motif model where very little is 
known a priori about the appearance of an unknown motif. However, once a set of motifs has 
been discovered (and clustered), we should incorporate this information directly into our motif 
discovery procedure. One proposal would be to use the posterior predictive distribution from 
our motif clustering model as the scoring function for motif discovery, which would increase 
the ability of our motif-finding algorithms to detect a motif that is similar to motifs that have 
already been discovered elsewhere. In addition to using our best partition to aide the discovery 
of new transcription factor binding sites, this strategy would also allow us to utilize the clustering 
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uncertainty and variability in motif core widths and alignments which is estimated by our model. 
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Figure 5: 95% posterior intervals of each motif width in JASPAR matrices 
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Figure 6: Clustering statistics between uniform and DP models 
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